minp ts
Dynamic data summarization for hierarchical spatial clustering
Abduaziz, Kayumov, Kim, Min Sik, Shin, Ji Sun
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) finds meaningful patterns in spatial data by considering density and spatial proximity. As the clustering algorithm is inherently designed for static applications, so have recent studies focused on accelerating the algorithm for static applications using approximate or parallel methods. However, much less attention has been given to dynamic environments, where even a single point insertion or deletion can require recomputing the clustering hierarchy from scratch due to the need of maintaining the minimum spanning tree (MST) over a complete graph. This paper addresses the challenge of enhancing the clustering algorithm for dynamic data. We present an exact algorithm that maintains density information and updates the clustering hierarchy of HDBSCAN during point insertions and deletions. Considering the hardness of adapting the exact algorithm to dynamic data involving modern workloads, we propose an online-offline framework. The online component efficiently summarizes dynamic data using a tree structure, called Bubble-tree, while the offline step performs the static clustering. Experimental results demonstrate that the data summarization adapts well to fully dynamic environments, providing compression quality on par with existing techniques while significantly improving runtime performance of the clustering algorithm in dynamic data workloads.
- North America > Canada > Alberta (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > China > Hong Kong (0.04)
Bi-Filtration and Stability of TDA Mapper for Point Cloud Data
Carlsson, Singh and Memoli's TDA mapper takes a point cloud dataset and outputs a graph that depends on several parameter choices. Dey, Memoli, and Wang developed Multiscale Mapper for abstract topological spaces so that parameter choices can be analyzed via persistent homology. However, when applied to actual data, one does not always obtain filtrations of mapper graphs. DBSCAN, one of the most common clustering algorithms used in the TDA mapper software, has two parameters, \textbf{$\epsilon$} and \textbf{MinPts}. If \textbf{MinPts = 1} then DBSCAN is equivalent to single linkage clustering with cutting height \textbf{$\epsilon$}. We show that if DBSCAN clustering is used with \textbf{MinPts $>$ 2}, a filtration of mapper graphs may not exist except in the absence of free-border points; but such filtrations exist if DBSCAN clustering is used with \textbf{MinPts = 1} or \textbf{2} as the cover size increases, \textbf{$\epsilon$} increases, and/or \textbf{MinPts} decreases. However, the 1-dimensional filtration is unstable. If one adds noise to a data set so that each data point has been perturbed by a distance at most \textbf{$\delta$}, the persistent homology of the mapper graph of the perturbed data set can be significantly different from that of the original data set. We show that we can obtain stability by increasing both the cover size and \textbf{$\epsilon$} at the same time. In particular, we show that the bi-filtrations of the homology groups with respect to cover size and $\epsilon$ between these two datasets are \textbf{2$\delta$}-interleaved.
- North America > United States > Wisconsin > La Crosse County > La Crosse (0.04)
- North America > United States > New Jersey (0.04)
- North America > United States > Iowa (0.04)
- (2 more...)
IPD:An Incremental Prototype based DBSCAN for large-scale data with cluster representatives
Saha, Jayasree, Mukherjee, Jayanta
DBSCAN is a fundamental density-based clustering technique that identifies any arbitrary shape of the clusters. However, it becomes infeasible while handling big data. On the other hand, centroid-based clustering is important for detecting patterns in a dataset since unprocessed data points can be labeled to their nearest centroid. However, it can not detect non-spherical clusters. For a large data, it is not feasible to store and compute labels of every samples. These can be done as and when the information is required. The purpose can be accomplished when clustering act as a tool to identify cluster representatives and query is served by assigning cluster labels of nearest representative. In this paper, we propose an Incremental Prototype-based DBSCAN (IPD) algorithm which is designed to identify arbitrary-shaped clusters for large-scale data. Additionally, it chooses a set of representatives for each cluster.
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- Asia > India > West Bengal > Kharagpur (0.04)
AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density
Wang, Ziqing, Ye, Zhirong, Du, Yuyang, Mao, Yi, Liu, Yanying, Wu, Ziling, Wang, Jun
DBSCAN has been widely used in density-based clustering algorithms. However, with the increasing demand for Multi-density clustering, previous traditional DSBCAN can not have good clustering results on Multi-density datasets. In order to address this problem, an adaptive Multi-density DBSCAN algorithm (AMD-DBSCAN) is proposed in this paper. An improved parameter adaptation method is proposed in AMD-DBSCAN to search for multiple parameter pairs (i.e., Eps and MinPts), which are the key parameters to determine the clustering results and performance, therefore allowing the model to be applied to Multi-density datasets. Moreover, only one hyperparameter is required for AMD-DBSCAN to avoid the complicated repetitive initialization operations. Furthermore, the variance of the number of neighbors (VNN) is proposed to measure the difference in density between each cluster. The experimental results show that our AMD-DBSCAN reduces execution time by an average of 75% due to lower algorithm complexity compared with the traditional adaptive algorithm. In addition, AMD-DBSCAN improves accuracy by 24.7% on average over the state-of-the-art design on Multi-density datasets of extremely variable density, while having no performance loss in Single-density scenarios. Our code and datasets are available at https://github.com/AlexandreWANG915/AMD-DBSCAN.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Guangdong Province > Zhuhai (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (2 more...)
Detecting Point Outliers Using Prune-based Outlier Factor (PLOF)
Babaei, Kasra, Chen, ZhiYuan, Maul, Tomas
Outlier detection (also known as anomaly detection or deviation detection) is a process of detecting data points in which their patterns deviate significantly from others. It is common to have outliers in industry applications, which could be generated by different causes such as human error, fraudulent activities, or system failure. Recently, density-based methods have shown promising results, particularly among which Local Outlier Factor (LOF) is arguably dominating. However, one of the major drawbacks of LOF is that it is computationally expensive. Motivated by the mentioned problem, this research presents a novel pruning-based procedure in which the execution time of LOF is reduced while the performance is maintained. A novel Prune-based Local Outlier Factor (PLOF) approach is proposed, in which prior to employing LOF, outlierness of each data instance is measured. Next, based on a threshold, data instances that require further investigation are separated and LOF score is only computed for these points. Extensive experiments have been conducted and results are promising. Comparison experiments with the original LOF and two state-of-the-art variants of LOF have shown that PLOF produces higher accuracy and precision while reducing execution time.
- Asia > Singapore (0.05)
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.04)
- Asia > Malaysia (0.04)
DIR-ST$^2$: Delineation of Imprecise Regions Using Spatio--Temporal--Textual Information
Tran, Cong, Shin, Won-Yong, Choi, Sang-Il
An imprecise region is referred to as a geographical area without a clearly-defined boundary in the literature. Previous clustering-based approaches exploit spatial information to find such regions. However, the prior studies suffer from the following two problems: the subjectivity in selecting clustering parameters and the inclusion of a large portion of the undesirable region (i.e., a large number of noise points). To overcome these problems, we present DIR-ST$^2$, a novel framework for delineating an imprecise region by iteratively performing density-based clustering, namely DBSCAN, along with not only spatio--textual information but also temporal information on social media. Specifically, we aim at finding a proper radius of a circle used in the iterative DBSCAN process by gradually reducing the radius for each iteration in which the temporal information acquired from all resulting clusters are leveraged. Then, we propose an efficient and automated algorithm delineating the imprecise region via hierarchical clustering. Experiment results show that by virtue of the significant noise reduction in the region, our DIR-ST$^2$ method outperforms the state-of-the-art approach employing one-class support vector machine in terms of the $\mathcal{F}_1$ score from comparison with precisely-defined regions regarded as a ground truth, and returns apparently better delineation of imprecise regions. The computational complexity of DIR-ST$^2$ is also analytically and numerically shown.
- Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (6 more...)
Novel Density-Based Clustering Algorithms for Uncertain Data
Zhang, Xianchao (Dalian University of Technology) | Liu, Han (Dalian University of Technology) | Zhang, Xiaotong (Dalian University of Technology) | Liu, Xinyue (Dalian University of Technology)
Density-based techniques seem promising for handling datauncertainty in uncertain data clustering. Nevertheless, someissues have not been addressed well in existing algorithms. Inthis paper, we firstly propose a novel density-based uncertaindata clustering algorithm, which improves upon existing algorithmsfrom the following two aspects: (1) it employs anexact method to compute the probability that the distance betweentwo uncertain objects is less than or equal to a boundaryvalue, instead of the sampling-based method in previouswork; (2) it introduces new definitions of core object probabilityand direct reachability probability, thus reducing thecomplexity and avoiding sampling. We then further improvethe algorithm by using a novel assignment strategy to ensurethat every object will be assigned to the most appropriatecluster. Experimental results show the superiority of our proposedalgorithms over existing ones.
- Asia > China > Liaoning Province > Dalian (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)